The purpose of the first individual project is to apply data analysis mechanisms on our two given datasets. These mechanisms will allow us to better understand what the data is representing and give us the opportunity to visualize the specific characteristics we are mostly interested in. In the project we are required to use a version control system and R Markdown. These two techniques offer the power of reproducible and is one the main principles program’s should have. The version control system we will using for the assignment is ‘Git’.
During the process of the project we will be working with 2 specific data sets (CSV files.)
These two data sets contain the number of bikes rented on specific dates and times in each of the continents/states. Additionally it also provides other variables that might influence the number of rented bikes. For example we are provided with the days Temperature, Wind Speed, Humidity and etc. These phenomena can impact the number of rented bikes in a good or bad way.
The two csv files beginning state is very messy, data is in different measure units and column names do not explain the attributes in a satisfactory way. To ensure our calculations, comparisons and understanding of the data is accurate we first need to apply a technique called Data Wrangling. By using this procedure our data will be updated in a “tidy”/clean state.
Once the csv files are in a neat state, we will start applying visualizing processes to explore the relationships between rented bikes in Washington, DC (USA) with Seoul (South Korea). Additionally we will utilize the visualizing diagrams to examine the influence of each variable on the number of rented bikes.
Lastly, we will implement statistical analysis methods, which will be helpful for predictive purposes. In addition to prediction methods, we will be able to examine if the variables and the data is reliable to make future projections on.
The given techniques will be applied on the two CSV files data:
seoul_Bikes <- read.csv("BikeSeoul.csv")
head(seoul_Bikes, 4)
## Date Rented.Bike.Count Hour Temperature.C. Humidity... Wind.speed..m.s.
## 1 01/12/2017 254 0 -5.2 37 2.2
## 2 01/12/2017 204 1 -5.5 38 0.8
## 3 01/12/2017 173 2 -6.0 39 1.0
## 4 01/12/2017 107 3 -6.2 40 0.9
## Visibility..10m. Dew.point.temperature.C. Solar.Radiation..MJ.m2.
## 1 2000 -17.6 0
## 2 2000 -17.6 0
## 3 2000 -17.7 0
## 4 2000 -17.6 0
## Rainfall.mm. Snowfall..cm. Seasons Holiday Functioning.Day
## 1 0 0 Winter No Holiday Yes
## 2 0 0 Winter No Holiday Yes
## 3 0 0 Winter No Holiday Yes
## 4 0 0 Winter No Holiday Yes
Above we can see the beginning state of the file(“BikeSeoul.csv”). We can notice that their are some issues with the architecture of the provided data set.
## Number of rows before removing unneeded records are:
nrow(seoul_Bikes)
## [1] 8760
seoul_Bikes <- seoul_Bikes %>%
filter(Functioning.Day != "No") %>% ## Filter out the rows with Functioning.Day equals "No"
select(-Functioning.Day) ## Remove column Functioning.Day
## Number of rows after removing unneeded records are:
nrow(seoul_Bikes)
## [1] 8465
From our first glimpse of the table we can notice that some column names do not have appropriate names. We need to covnert the following columns names.
## Header Names before Renaming
names(seoul_Bikes)
## [1] "Date" "Rented.Bike.Count"
## [3] "Hour" "Temperature.C."
## [5] "Humidity..." "Wind.speed..m.s."
## [7] "Visibility..10m." "Dew.point.temperature.C."
## [9] "Solar.Radiation..MJ.m2." "Rainfall.mm."
## [11] "Snowfall..cm." "Seasons"
## [13] "Holiday"
seoul_Bikes <- seoul_Bikes %>%
rename(Count = Rented.Bike.Count, ## Using function we are renaming the columns.
Temperature = Temperature.C.,
WindSpeed = Wind.speed..m.s.,
Season = Seasons,
Humidity = Humidity...)
## Header Names after Renaming
names(seoul_Bikes)
## [1] "Date" "Count"
## [3] "Hour" "Temperature"
## [5] "Humidity" "WindSpeed"
## [7] "Visibility..10m." "Dew.point.temperature.C."
## [9] "Solar.Radiation..MJ.m2." "Rainfall.mm."
## [11] "Snowfall..cm." "Season"
## [13] "Holiday"
## Printing class of column Date before mutation.
class(seoul_Bikes$Date)
## [1] "character"
seoul_Bikes <- seoul_Bikes %>%
convertDate("dmy") ## Need to specify the format of the date in the cell before converting.
## Printing class of column Date after mutation.
class(seoul_Bikes$Date)
## [1] "Date"
## Printing column names to show that FullDate does not exist.
names(seoul_Bikes)
## [1] "Date" "Count"
## [3] "Hour" "Temperature"
## [5] "Humidity" "WindSpeed"
## [7] "Visibility..10m." "Dew.point.temperature.C."
## [9] "Solar.Radiation..MJ.m2." "Rainfall.mm."
## [11] "Snowfall..cm." "Season"
## [13] "Holiday"
## Calling a self-declared function to undertake this task.
seoul_Bikes <- seoul_Bikes %>%
createFullDate()
## Printing column names to show creation of FullDate.
names(seoul_Bikes)
## [1] "Date" "Count"
## [3] "Hour" "Temperature"
## [5] "Humidity" "WindSpeed"
## [7] "Visibility..10m." "Dew.point.temperature.C."
## [9] "Solar.Radiation..MJ.m2." "Rainfall.mm."
## [11] "Snowfall..cm." "Season"
## [13] "Holiday" "FullDate"
## Printing first 6 rows of FullDate to show the value is in correct format.
head(seoul_Bikes$FullDate)
## [1] "2017-12-01 00:00:00 UTC" "2017-12-01 01:00:00 UTC"
## [3] "2017-12-01 02:00:00 UTC" "2017-12-01 03:00:00 UTC"
## [5] "2017-12-01 04:00:00 UTC" "2017-12-01 05:00:00 UTC"
Additionally to change the column type/class from ‘character’ to ‘factor’.
The reason we want to apply this change is to be able order our data to the bikes rented during a holiday and the ones rented when it was not a holiday. Factor type allows us to categorize data.
Implemantation idea
Using tidyverse function mutate in combination with ifelse() conditional statement we will change the record values. Once the inputs have changed we create it to factors/category with the order :
## Printing the 6 fist rows to show factor values before changing
head(seoul_Bikes$Holiday)
## [1] "No Holiday" "No Holiday" "No Holiday" "No Holiday" "No Holiday"
## [6] "No Holiday"
## Printing the class of Holiday to show its a 'character'
class(seoul_Bikes$Holiday)
## [1] "character"
seoul_Bikes <- seoul_Bikes %>%
mutate(Holiday = ifelse(Holiday == "No Holiday", "No", "Yes")) %>% ## Changing values to Yes & No
mutate(Holiday = factor(Holiday, levels = c("Yes", "No"))) ## Changing type to factor with seq order.
## Printing the 6 fist rows to show factor values have changed.
head(seoul_Bikes$Holiday)
## [1] No No No No No No
## Levels: Yes No
## Printing the class of Holiday to show its a 'factor'
class(seoul_Bikes$Holiday)
## [1] "factor"
## Printing class of the column to show its not a factor yet.
class(seoul_Bikes$Season)
## [1] "character"
seoul_Bikes <- seoul_Bikes %>%
mutate(Season = factor(Season, levels = c("Spring", "Summer", "Autumn", "Winter")))
## Printing the 6 fist rows to show factor levels order after the change of class
head(seoul_Bikes$Season)
## [1] Winter Winter Winter Winter Winter Winter
## Levels: Spring Summer Autumn Winter
Remove the following columns:
## Printing the column names of the table before removing unwanted data.
names(seoul_Bikes)
## [1] "Date" "Count"
## [3] "Hour" "Temperature"
## [5] "Humidity" "WindSpeed"
## [7] "Visibility..10m." "Dew.point.temperature.C."
## [9] "Solar.Radiation..MJ.m2." "Rainfall.mm."
## [11] "Snowfall..cm." "Season"
## [13] "Holiday" "FullDate"
## Calling a self-declared function to undertaken given task.
seoul_Bikes <- seoul_Bikes %>%
deleteColumn( c("Visibility..10m.",
"Solar.Radiation..MJ.m2.",
"Rainfall.mm.",
"Dew.point.temperature.C.",
"Snowfall..cm."))
## Printing column names and first 6 rows to show the end result of the data that has been cleaned.
names(seoul_Bikes)
## [1] "Date" "Count" "Hour" "Temperature" "Humidity"
## [6] "WindSpeed" "Season" "Holiday" "FullDate"
head(seoul_Bikes)
## Date Count Hour Temperature Humidity WindSpeed Season Holiday
## 1 2017-12-01 254 0 -5.2 37 2.2 Winter No
## 2 2017-12-01 204 1 -5.5 38 0.8 Winter No
## 3 2017-12-01 173 2 -6.0 39 1.0 Winter No
## 4 2017-12-01 107 3 -6.2 40 0.9 Winter No
## 5 2017-12-01 78 4 -6.0 36 2.3 Winter No
## 6 2017-12-01 100 5 -6.4 37 1.5 Winter No
## FullDate
## 1 2017-12-01 00:00:00
## 2 2017-12-01 01:00:00
## 3 2017-12-01 02:00:00
## 4 2017-12-01 03:00:00
## 5 2017-12-01 04:00:00
## 6 2017-12-01 05:00:00
We have successfully Data Wrangled the first file(Seoul), now we need to apply Data Wrangling to the second file(Washington). The conventions which we will apply should bring the both files in a compatible format. Names of columns and measurements units must be the same, so we can compare the two data files.
washington_Bikes <- read.csv("BikeWashingtonDC.csv")
head(washington_Bikes)
## instant dteday season yr mnth hr holiday weekday workingday weathersit
## 1 1 2011-01-01 1 0 1 0 0 6 0 1
## 2 2 2011-01-01 1 0 1 1 0 6 0 1
## 3 3 2011-01-01 1 0 1 2 0 6 0 1
## 4 4 2011-01-01 1 0 1 3 0 6 0 1
## 5 5 2011-01-01 1 0 1 4 0 6 0 1
## 6 6 2011-01-01 1 0 1 5 0 6 0 2
## temp atemp hum windspeed casual registered cnt
## 1 0.24 0.2879 0.81 0.0000 3 13 16
## 2 0.22 0.2727 0.80 0.0000 8 32 40
## 3 0.22 0.2727 0.80 0.0000 5 27 32
## 4 0.24 0.2879 0.75 0.0000 3 10 13
## 5 0.24 0.2879 0.75 0.0000 0 1 1
## 6 0.24 0.2576 0.75 0.0896 0 1 1
Above we can see the beginning state of the file (“BikeWashingtonDC.csv”). We can notice that their are some issues with the architecture of the provided data set.
*Remove the following columns: unique record index, year, month, day of the week, working day, weather condition, normalised feeling temperature and number of bikes rented by casual and registered users (i.e. keep only the total count).
We will need to remove the following. +instant (unique number for each record) + yr (year) + mnth (month) + weekday (day of the week) + workingday (if the specific day was a holiday or not) + weathersit (weather condition) + atemp (normalized feeling temparature) + casual (number of bikes rented by casul users) + registered (number of bikes rented by registered users)
## Columns in the file before removing
names(washington_Bikes)
## [1] "instant" "dteday" "season" "yr" "mnth"
## [6] "hr" "holiday" "weekday" "workingday" "weathersit"
## [11] "temp" "atemp" "hum" "windspeed" "casual"
## [16] "registered" "cnt"
## Calling self-declared function to undertake given task
washington_Bikes <- washington_Bikes %>%
deleteColumn( c("instant",
"yr",
"mnth",
"weekday",
"workingday",
"weathersit",
"atemp",
"casual",
"registered"))
## Columns in file after removing.
names(washington_Bikes)
## [1] "dteday" "season" "hr" "holiday" "temp" "hum"
## [7] "windspeed" "cnt"
*Change the name of the columns to match the ones for Seoul. We need to change the following table : + ‘dteday’ to ‘Date’ + ‘cnt’ to ‘Count’ + ‘hr’ to ‘Hour’ + ‘temp’ to ‘Temperature’ + ‘hum’ to ‘Humidity’ + ‘windspeed’ to ‘WindSpeed’ + ‘season’ to ‘Season’ + ‘holiday’ to ‘Holiday’
The column names should have same case-sensitivity if we would like to join them later.## Header Names before Renaming
names(washington_Bikes)
## [1] "dteday" "season" "hr" "holiday" "temp" "hum"
## [7] "windspeed" "cnt"
washington_Bikes <- washington_Bikes %>%
rename(Count = cnt, ## Using function we are renaming the columns.
Temperature = temp,
WindSpeed = windspeed,
Season = season,
Humidity = hum,
Date = dteday,
Hour = hr,
Holiday = holiday)
## Header Names after Renaming
names(washington_Bikes)
## [1] "Date" "Season" "Hour" "Holiday" "Temperature"
## [6] "Humidity" "WindSpeed" "Count"
## Printing first 6 rows to show the value now in decimal
head(washington_Bikes$Humidity)
## [1] 0.81 0.80 0.80 0.75 0.75 0.75
washington_Bikes <- washington_Bikes %>%
mutate(Humidity = Humidity * 100) ## multiply existing value with 100 to make a percentage value.
## Printing first 6 rows to show values out of a hundred (in %), smallest is 0 and largest 100.
head(washington_Bikes$Humidity)
## [1] 81 80 80 75 75 75
## Printing first 6 rows to show the value are now normalized.
head(washington_Bikes$Temperature)
## [1] 0.24 0.22 0.22 0.24 0.24 0.24
Tmin <- -8
Tmax <- 39
washington_Bikes <- washington_Bikes %>%
mutate(Temperature = (Temperature)*(Tmax-Tmin)+Tmin) ## apply formula so we convert back to degree celsius.
## Printing first 6 rows to show values out of in degree celsius.
head(washington_Bikes$Temperature)
## [1] 3.28 2.34 2.34 3.28 3.28 3.28
## Printing first 6 rows to show the value are in (km/s)/69
head(washington_Bikes$WindSpeed)
## [1] 0.0000 0.0000 0.0000 0.0000 0.0000 0.0896
makeToKM <- 69
multiplayerConstant <- 0.2777778
washington_Bikes <- washington_Bikes %>%
mutate(WindSpeed = ((WindSpeed)*(makeToKM))*multiplayerConstant) ## apply formula so we convert back to degree m/s
## Printing first 6 rows to show values are now in m/s
head(washington_Bikes$WindSpeed)
## [1] 0.000000 0.000000 0.000000 0.000000 0.000000 1.717333
## Printing the 6 fist rows to show factor values before changing
head(washington_Bikes$Season)
## [1] 1 1 1 1 1 1
## Printing the class of Season to show its a 'integer'
class(washington_Bikes$Season)
## [1] "integer"
washington_Bikes <- washington_Bikes %>%
mutate(Season = ifelse(Season == 1, "Winter", Season)) %>% ## Changing values to the appropriate factors Winter | Summer | Autumn | Spring.
mutate(Season = ifelse(Season == 2, "Spring", Season)) %>%
mutate(Season = ifelse(Season == 3, "Summer", Season)) %>%
mutate(Season = ifelse(Season == 4, "Autumn", Season)) %>%
mutate(Season = factor(Season, levels = c("Spring", "Summer", "Autumn", "Winter"))) ## Changing type to factor with seq order.
## Printing the 6 fist rows to show factor values have changed.
head(washington_Bikes$Season)
## [1] Winter Winter Winter Winter Winter Winter
## Levels: Spring Summer Autumn Winter
## Printing the class of Season to show its a 'factor'
class(washington_Bikes$Season)
## [1] "factor"
## Printing the 6 fist rows to show factor values before changing
head(washington_Bikes$Holiday)
## [1] 0 0 0 0 0 0
## Printing the class of Holiday to show its a 'character'
class(washington_Bikes$Holiday)
## [1] "integer"
washington_Bikes <- washington_Bikes %>%
mutate(Holiday = ifelse(Holiday == 0, "No", "Yes")) %>% ## Changing values to Yes & No (0=No, 1=Yes).
mutate(Holiday = factor(Holiday, levels = c("Yes", "No"))) ## Changing type to factor with seq order.
## Printing the 6 fist rows to show factor values have changed.
head(washington_Bikes$Holiday)
## [1] No No No No No No
## Levels: Yes No
## Printing the class of Holiday to show its a 'factor'
class(washington_Bikes$Holiday)
## [1] "factor"
## Printing class of column Date before mutation.
class(washington_Bikes$Date)
## [1] "character"
washington_Bikes <- washington_Bikes %>%
convertDate("ymd")
## Printing class of column Date after mutation.
class(washington_Bikes$Date)
## [1] "Date"
## Printing column names to show that FullDate does not exist.
names(washington_Bikes)
## [1] "Date" "Season" "Hour" "Holiday" "Temperature"
## [6] "Humidity" "WindSpeed" "Count"
## Calling a self-declared function to apply this task.
washington_Bikes <- washington_Bikes %>%
createFullDate()
## Printing column names to show creation of FullDate.
names(washington_Bikes)
## [1] "Date" "Season" "Hour" "Holiday" "Temperature"
## [6] "Humidity" "WindSpeed" "Count" "FullDate"
## Printing first 6 rows of FullDate to show the value is in correct format.
head(washington_Bikes$FullDate)
## [1] "2011-01-01 00:00:00 UTC" "2011-01-01 01:00:00 UTC"
## [3] "2011-01-01 02:00:00 UTC" "2011-01-01 03:00:00 UTC"
## [5] "2011-01-01 04:00:00 UTC" "2011-01-01 05:00:00 UTC"
head(washington_Bikes)
## Date Season Hour Holiday Temperature Humidity WindSpeed Count
## 1 2011-01-01 Winter 0 No 3.28 81 0.000000 16
## 2 2011-01-01 Winter 1 No 2.34 80 0.000000 40
## 3 2011-01-01 Winter 2 No 2.34 80 0.000000 32
## 4 2011-01-01 Winter 3 No 3.28 75 0.000000 13
## 5 2011-01-01 Winter 4 No 3.28 75 0.000000 1
## 6 2011-01-01 Winter 5 No 3.28 75 1.717333 1
## FullDate
## 1 2011-01-01 00:00:00
## 2 2011-01-01 01:00:00
## 3 2011-01-01 02:00:00
## 4 2011-01-01 03:00:00
## 5 2011-01-01 04:00:00
## 6 2011-01-01 05:00:00
We have successfully Data Wrangled the second file (Washington). Now our two files are compatible with each other. Sharing same column names and measure units are the same for each column.
We will now proceed to the data visualization tasks.
We will apply some visualization and statistical analysis to compare the air temperature of both locations.
Due to my personal Laptop not being powerful enough i will be using only ggplot2 for the visualisation aspect rather than ggplotly. We will see both a point plot and boxplot for both different datasets.
## aligns the plot/figure in the center
## height/width gives size in inches
ggplot(seoul_Bikes) +
geom_point(aes(x = Date, y = Temperature), col="dark grey") +
stat_smooth(aes(x = Date, y = Temperature)) + ## To see the distribution density
xlab("Date") + ## Naming x and y axes.
ylab("Air Temperature (degrees celsius)") +
ggtitle("Air Temperature variation of Seoul, Sount Korea") + ## Adding title to graph
theme(plot.title = element_text(hjust = 0.5)) ## To align the title of the graph in the center, code fourd on stackoverflow :" https://stackoverflow.com/questions/40675778/center-plot-title-in-ggplot2 "
Using the point plot and a stat_smooth() we can view the Air Temperature’s distribution density of how the air temperature varies in Seoul, South Korea. There is a variety of temperature’s the warmest months are between may and august while the others get colder.
## The mean average of air temperature.
seoul_Bikes %>%
summarise(Mean=mean(Temperature))
## Mean
## 1 12.77106
## The hottest day
seoul_Bikes %>%
summarise(Maximum=max(Temperature))
## Maximum
## 1 39.4
## The coldest day.
seoul_Bikes %>%
summarise(Minimume=min(Temperature))
## Minimume
## 1 -17.8
It can reach very hot days up to nearly 40 degrees cesius, but it can also be very cold nearly -18 degrees celsius.
ggplot(washington_Bikes) +
geom_point(aes(x = Date, y = Temperature), col = "orange") +
stat_smooth(aes(x = Date, y = Temperature)) +
xlab("Date") +
ylab("Air Temperature (degrees celsius)") +
ggtitle("Air Temperature variation of Washinghton,DC, America") +
theme(plot.title = element_text(hjust = 0.5))
Utilizing the point plot and a stat_smooth() we can view the Air Temperature’s distribution density of how the air temperature varies in Washinghton, DC. There is a similar density as Seoul. We can see the curve going up and back down two times. This is because the data in Washinghton’s is over two years rather than one year in Seou’s data.
## The mean average of air temperature.
washington_Bikes %>%
summarise(Mean=mean(Temperature))
## Mean
## 1 15.3584
## The hottest day
washington_Bikes %>%
summarise(Maximum=max(Temperature))
## Maximum
## 1 39
## The coldest day.
washington_Bikes %>%
summarise(Minimume=min(Temperature))
## Minimume
## 1 -7.06
The average temperature in Washinghton is higher than Seoul. Both locations hotest days are close. But Seoul winter is a lot colder than Washinghton. We can see a difference of 10 degrees celsius.
ggplot(seoul_Bikes) +
geom_boxplot(aes(x = Season, y = Count), col = "dark grey") +
xlab("Season") +
ylab("Number of bikes rented") +
ggtitle("Bikes Rented per Season in Seoul, South Korea") +
theme(plot.title = element_text(hjust = 0.5))
From the boxplot graph above we can observe there is a significant drop of bikes rented in the winter. The highest renting season is summer, although autumn and spring are not far behind. We can conclude that in Seoul, the bikes rented a day is depended on the season.
ggplot(washington_Bikes) +
geom_boxplot(aes(x = Season, y = Count), col = "orange") +
xlab("Season") +
ylab("Number of bikes rented") +
ggtitle("Bikes Rented per Season in Washinghton, DC, America") +
theme(plot.title = element_text(hjust = 0.5))
From the boxplot graph above we can observe there is a significant drop of bikes rented in the winter as well in Washington. The highest renting season is summer again, although autumn and spring are not far behind. We can conclude that for Washington, the bikes rented a day is depended on the season.
Both locations are have a dramatic fall of rented bikes in the winter. We can conclude that the number of bikes of rented is depended on the season.
ggplot(seoul_Bikes) +
geom_boxplot(aes(x = Holiday, y = Count), col = "dark grey") +
xlab("Holiday") +
ylab("Number of bikes rented") +
ggtitle("Bikes Rented per Holiday in Seoul, South Korea") +
theme(plot.title = element_text(hjust = 0.5))
ggplot(washington_Bikes) +
geom_boxplot(aes(x = Holiday, y = Count), col = "orange") +
xlab("Holiday") +
ylab("Number of bikes rented") +
ggtitle("Bikes Rented per Holiday in Washinghton DC, America") +
theme(plot.title = element_text(hjust = 0.5))
By observing the two above boxplot graphs, we can sum up that more bikes are rented when there is not a holiday. There is a bigger difference of the number of bikes rented in the Seoul dataset. We can see that when its not a holiday there is big difference to the number of rented bikes. While in the America dataset, the difference is smaller.
Maybe a variable that depends on holiday is that the residents need to work when there are no holidays. The way of trasnportation is by bike.
grouped_seoul <- seoul_Bikes%>%
group_by(Hour) %>% ##Grouping data in groups per hour.
summarise(Average.Rented=mean(Count)) ## Finding average number of bikes rented per hour.
ggplot(grouped_seoul,aes(x = Hour, y = Average.Rented, fill = Hour)) +
geom_bar(stat = "identity") +
labs(colour = "Hour of Day") +
xlab("Hour of Day") +
ylab("Number of bikes rented") +
ggtitle("Average Bikes Rented per Hour in Seoul, South Korea") +
theme(plot.title = element_text(hjust = 0.5))
Using the bar graph we can view on average how busy each hour is in Seou. The graph is plotted using the average number of bikes rented per day. There is a big drop between 4-5 o’clock in the morning. There is a significant rise at 8 o’ clock in the morning but falls again. We can view the demand of bikes start rising slowly slowly from 10 o’ clock in the morning and richest the peak busiest hour at 18 o’clock. After that the demand starts dropping again.
Using the mean demand of rented bikes per hour, our conclusion is that the busiest hours of the day is 8 in the mornig and 17-19 in the afternoon. The busiest hour is at 18:00 afternoon.
grouped_wash <- washington_Bikes%>%
group_by(Hour) %>% ##Grouping data in groups per hour.
summarise(Average.Rented=mean(Count)) ## Finding average number of bikes rented per hour.
ggplot(grouped_wash,aes(x = Hour, y = Average.Rented, fill = Hour)) +
geom_bar(stat = "identity") +
labs(colour = "Hour of Day") +
xlab("Hour of Day") +
ylab("Number of bikes rented") +
ggtitle("Average Bikes Rented per Hour in Washinghton Dc, America") +
theme(plot.title = element_text(hjust = 0.5))
Using the bar graph we can view on average how busy each hour is Washington. The graph is plotted using the average number of bikes rented per day. There is a big drop between 3-5 o’clock in the morning. There is a significant rise at 8 o’ clock in the morning but falls again. We can view the demand of bikes start rising slowly slowly from 10 o’ clock in the morning and richest the peak busiest hour at 17 o’clock and 18 o’clock. After that the demand starts dropping again.
Using the mean demand of rented bikes per hour, our conclusion is that the busiest hours of the day is 8 in the morning and 16-18 in the afternoon. The busiest hour is at 15:00 afternoon.
There is a similarity in both locations of the demand distribution by hour. This might be because at 8-9 in the morning the popultaion starts work and 17-18 the population finishes from work. This could also be a reason that there is more demand on bikes when its not holiday, since the citizents need to go to work.
ggplot(seoul_Bikes) +
geom_point(aes(x = Humidity, y = Temperature, size = Count, color = WindSpeed )) +
stat_smooth(aes(x = Humidity, y = Temperature)) +
xlab("Humidity") +
ylab("Temperature") +
ggtitle("Bikes rented based on the 3 meteorological variables, Seoul") +
theme(plot.title = element_text(hjust = 0.5))
The above point plot is using all three meteorological variables to examine if the demand of bikes rented is depended to them. The size of the point reveals the number of bikes rented. The larger the point the more bikes were rented. When Temperature goes over ten we can see the number of bikes rented has incresed. By the color of each point we can understand that as the Wind Speed grows higher then 4 m/s the number of rented bikes decreases. Additionally using the x and y axes it is noticed when Temperature and Humidity is low or high the number of rented bikes are less. We dont know if this observation is due to the combination of all variables, or just by one of the variable. Below we will review graphs for each meteorological attribute.
ggplot(seoul_Bikes) +
geom_point(aes(x = Humidity, y = Count, size = Count) , col ="red") +
stat_smooth(aes(x = Humidity, y = Count), method = "lm", col = "black") +
stat_smooth(aes(x = Humidity, y = Count), col = "blue") +
xlab("Humidity out of %") +
ylab("Number Rented Bikes") +
ggtitle("Bikes rented based on Humidity") +
theme(plot.title = element_text(hjust = 0.5))
The point plot above is showcasing the relationship between the number of bikes rented with the level of humidity. In the graph we have two lines that reveals the relationships.
ggplot(seoul_Bikes) +
geom_point(aes(x = WindSpeed, y = Count, size = Count) , col ="green") +
stat_smooth(aes(x = WindSpeed, y = Count), method = "lm", col = "black") +
stat_smooth(aes(x = WindSpeed, y = Count), col = "blue") +
xlab("WindSpeed m/s") +
ylab("Number Rented Bikes") +
ggtitle("Bikes rented based on WindSpeed") +
theme(plot.title = element_text(hjust = 0.5))
The point plot above is showcasing the relationship between the number of bikes rented with the level of wind speed In the graph we have two lines that reveals the relationships.
ggplot(seoul_Bikes) +
geom_point(aes(x = Temperature, y = Count, size = Count) , col ="orange") +
stat_smooth(aes(x = Temperature, y = Count), method = "lm", col = "black") +
stat_smooth(aes(x = Temperature, y = Count), col = "blue") +
xlab("Temperature degrees Celsius") +
ylab("Number Rented Bikes") +
ggtitle("Bikes rented based on Temperature") +
theme(plot.title = element_text(hjust = 0.5))
The point plot above is showcasing the relationship between the number of bikes rented with the level of temperature In the graph we have two lines that reveals the relationships.
From the above graphs we can accept that each meteorological variable affects the number of bikes rented. There is a big difference of number of bikes rented when the wind speed is very strong or when the temperature and humidity is very low or very high.
When the temperature is between 0-20 degrees celsius, humidity is between 25-75 % and wind speed is under 4m/s, the number bikes rented are there highest.
#### Washington DC, America
ggplot(washington_Bikes) +
geom_point(aes(x = Humidity, y = Temperature, size = Count, color = WindSpeed )) +
stat_smooth(aes(x = Humidity, y = Temperature)) +
xlab("Humidity") +
ylab("Temperature") +
ggtitle("Bikes rented based on the 3 meteorological variables, Washington DC") +
theme(plot.title = element_text(hjust = 0.5))
The above point plot is using all three meteorological variables to examine if the demand of bikes rented is depended to them. The size of the point reveals the number of bikes rented. The larger the point the more bikes were rented. When Temperature goes over ten we can see the number of bikes rented have a small increase, not as big as in Seoul. By the color of each point we can understand that as the Wind Speed grows higher then 8-12 m/s the number of rented bikes decreases. The wind speed in America is more powerful. When the wind speed is at 4m/s the number of rented bikes are still high. Additionally using the x we can observe when Humidity is low there are limited records of rented bikes and the number of bikes rented is small. From the y axes we can see the relationship between the temperature and the number of rented bikes. Records of rented bikes are approximately the same between all the temperature values, but when temperature goes over 10 degrees celcius we can see the size of the points grow. This is to detail that the number of bikes rented are higher. We don’t know if this observation is due to the combination of all variables, or just by one of the variable. Below we will review graphs for each meteorological attribute.
ggplot(washington_Bikes) +
geom_point(aes(x = Humidity, y = Count, size = Count) , col ="red") +
stat_smooth(aes(x = Humidity, y = Count), method = "lm", col = "black") +
stat_smooth(aes(x = Humidity, y = Count), col = "blue") +
xlab("Humidity out of %") +
ylab("Number Rented Bikes") +
ggtitle("Bikes rented based on Humidity") +
theme(plot.title = element_text(hjust = 0.5))
The point plot above is showcasing the relationship between the number of bikes rented with the level of humidity. In the graph we have two lines that reveals the relationships.
ggplot(washington_Bikes) +
geom_point(aes(x = WindSpeed, y = Count, size = Count) , col ="green") +
stat_smooth(aes(x = WindSpeed, y = Count), method = "lm", col = "black") +
stat_smooth(aes(x = WindSpeed, y = Count), col = "blue") +
xlab("WindSpeed m/s") +
ylab("Number Rented Bikes") +
ggtitle("Bikes rented based on WindSpeed") +
theme(plot.title = element_text(hjust = 0.5))
The point plot above is showcasing the relationship between the number of bikes rented with the level of wind speed In the graph we have two lines that reveals the relationships.
ggplot(washington_Bikes) +
geom_point(aes(x = Temperature, y = Count, size = Count) , col ="orange") +
stat_smooth(aes(x = Temperature, y = Count), method = "lm", col = "black") +
stat_smooth(aes(x = Temperature, y = Count), col = "blue") +
xlab("Temperature degrees Celsius") +
ylab("Number Rented Bikes") +
ggtitle("Bikes rented based on Temperature") +
theme(plot.title = element_text(hjust = 0.5))
The point plot above is showcasing the relationship between the number of bikes rented with the level of temperature In the graph we have two lines that reveals the relationships.
From the above graphs we can accept that each meteorological variable affects the number of bikes rented. There is a big difference of number of bikes rented when the wind speed is very strong or when the humidity is very low. As temperature increases so does the number of bikes rented. There isn’t a big fall of bikes rented when considering Temperature for Washington. In the Seoul dataset Temperature was a big influence to the number of bikes rented.
In Washington the busiest weather conditions are Temperature is between 20-30 degrees celsius, wind speed is between 3-6 m/s and Humidity is between 25-75%.
linear_model_log_seoul <- lm(log(Count) ~ Season + Humidity + Temperature + WindSpeed,
data = seoul_Bikes)
summary(linear_model_log_seoul)
##
## Call:
## lm(formula = log(Count) ~ Season + Humidity + Temperature + WindSpeed,
## data = seoul_Bikes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1073 -0.4281 0.0812 0.5493 2.4352
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.7336965 0.0467062 144.171 < 2e-16 ***
## SeasonSummer 0.0036038 0.0327843 0.110 0.91247
## SeasonAutumn 0.3733211 0.0261578 14.272 < 2e-16 ***
## SeasonWinter -0.3830362 0.0349918 -10.946 < 2e-16 ***
## Humidity -0.0224974 0.0004844 -46.441 < 2e-16 ***
## Temperature 0.0492700 0.0015053 32.732 < 2e-16 ***
## WindSpeed 0.0253809 0.0093544 2.713 0.00668 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8276 on 8458 degrees of freedom
## Multiple R-squared: 0.4941, Adjusted R-squared: 0.4937
## F-statistic: 1377 on 6 and 8458 DF, p-value: < 2.2e-16
The summary gives us some explanation of the data and the influence of each variable has on the variable Count (Number of rented bikes). * Estimation column indicates the gradient value. If the gradient is positive, this means as response variable value increases when the explanatory variable increases. If the gradient is negative this means as the explanatory variable increases that the response variable decreases. * From the above point, we can observe that as Wind Speed, Temperature increases so does the number of rented bikes. While Humidity increases the number of rented bikes decreases. * Season “Spring” does not appear in the summary because it is a categorized variable. “Winter”, “Autumn” and “Summer” are estimations from the “Spring” Value. Resulting to “Summer” having a gradient -0.0036038 from “Spring”. * Our R-Squared value is approximately 0.5 that represents the dependency of each explanatory variable with the response. This means that 50% of the independent variables is explains the response variable. This shows our variables are independent from each other. A value of 0.5 is not high but also not low, its a medium effect on each other. * From the Pr(>|t|) we can see the variables that make the most influence on the outcome, for this data Temperature, Humidity and Season are a big influence on the outcome of the data.
linear_model_log_wash <- lm(log(Count) ~ Season + Humidity + Temperature + WindSpeed,
data = washington_Bikes)
summary(linear_model_log_wash)
##
## Call:
## lm(formula = log(Count) ~ Season + Humidity + Temperature + WindSpeed,
## data = washington_Bikes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.4834 -0.6069 0.2458 0.8440 3.5203
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.6264010 0.0576892 80.195 < 2e-16 ***
## SeasonSummer -0.3651680 0.0300276 -12.161 < 2e-16 ***
## SeasonAutumn 0.5361839 0.0289332 18.532 < 2e-16 ***
## SeasonWinter 0.1046103 0.0341346 3.065 0.00218 **
## Humidity -0.0233425 0.0005317 -43.901 < 2e-16 ***
## Temperature 0.0797914 0.0017401 45.856 < 2e-16 ***
## WindSpeed 0.0237920 0.0043072 5.524 3.37e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.263 on 17372 degrees of freedom
## Multiple R-squared: 0.278, Adjusted R-squared: 0.2777
## F-statistic: 1115 on 6 and 17372 DF, p-value: < 2.2e-16
Now we will explain the summary data for Washington DC data. * From the above point, we can observe that as Wind Speed, Temperature increases so does the number of rented bikes. While Humidity increases the number of rented bikes decreases. * Season “Spring” does not appear in the summary because it is a categorized variable. “Winter”, “Autumn” and “Summer” are estimations from the “Spring” Value. Resulting to “Summer” having a gradient -0.3651680 from “Spring”. * Our R-Squared value is approximately 0.3 that represents the dependency of each explanatory variable with the response. This means that 30% of the independent variables is explains the response variable. This shows our variables are independent from each other. A value of 0.3 is low measure, this means a lot of data are away from the slope. * From the Pr(>|t|) we can see the variables that make the most influence on the outcome, for this data Wind Speed is the biggest influence on the outcome of the data.
We can see that in Seoul Wind Speed was not the biggest influence to the response variable while for Washington it is. Additionally winter in America is a lot busier and the number of rented bikes don’t differ a lot from other seasons as they do in Seoul.
In both situations our R-Squared value is not high but low and medium.
We will examine the confidence interval of the data at 97%.
Our confidence intervals of 97% does not mean that future values will fall in between the lower and upper limit. It means that if we run 100 random tests 97 of the results will fall in the interval between lower and upper limit and 3 won’t.
confint(linear_model_log_seoul, level = 0.97)
## 1.5 % 98.5 %
## (Intercept) 6.632322686 6.83507030
## SeasonSummer -0.067553139 0.07476072
## SeasonAutumn 0.316546593 0.43009553
## SeasonWinter -0.458984431 -0.30708797
## Humidity -0.023548780 -0.02144592
## Temperature 0.046002904 0.05253719
## WindSpeed 0.005077663 0.04568421
confint(linear_model_log_wash, level = 0.97)
## 1.5 % 98.5 %
## (Intercept) 4.50119998 4.75160198
## SeasonSummer -0.43033590 -0.30000019
## SeasonAutumn 0.47339115 0.59897666
## SeasonWinter 0.03052896 0.17869159
## Humidity -0.02449639 -0.02218851
## Temperature 0.07601506 0.08356781
## WindSpeed 0.01444423 0.03313979
The confidence intervals are not the most reliable source in my opinion for the given situation. The reason is because we are dealing with variables controlled by weather. We cannot predict with little uncertainty what our weather conditions will be tomorrow or in a year. The weather changes really fast and is unpredictable. For this reason even we simulate such process we cannot know with 97% certainty the values will actually fall in those intervals. Our population parameters might look the same as our samples parameters but might also look very different due to the reason our conditions might actually be very different.
We will use our linear model to predict future numbers. We first need to create the data we want our future data to represent. We will create a data.frame called predictData holding and representing the future values. We will use the predict() function to predict future data. For the interval argument will be using the value “prediction”, this is because we are doing a prediction to get obtain values that are uncertain. We cannot predict future values with certainty since we don’t know what might happen. A prediction interval widens the difference between the lower and upper value that the future data will fall in. In my opinion is a safer option because you plan for the worst. Using a confidence interval would have smaller ranges and it is more likely in the future the values wont fall in the interval computed. Additionally our R-Squared value is low, this means that our explanatory variables do not explain our response variable at the level we want. Since is under 0.7, our linear model is not very good.
## Need to create the data that we want our prediction to be based on
predictData <- data.frame(Season = "Winter", ##Assigning the data we want to make a prediction on.
Temperature = 0.0,
Humidity = 20.0,
WindSpeed = 0.5)
predict(linear_model_log_seoul, ##Creating the prediction based on the linear model. With level 90%
newdata = predictData,
level = 0.90,
interval = "prediction") ## Using "prediction" instead of "confidence" to have wider ranges since we are predicting data from random experiments.
## fit lwr upr
## 1 5.913404 4.5512 7.275607
After applying the prediction we can see the mean is 5.913404 and the values will range from 4.5512 to 7.275607 for variable Count when Season will be “Winter”, Temperature will be 0 degrees celsius, Humidity will be up to 20% and Wind Speed will be 0.5 m/s
predict(linear_model_log_wash,
newdata = predictData,
level = 0.90,
interval = "prediction")
## fit lwr upr
## 1 4.276058 2.19759 6.354526
After applying the prediction we can see the mean is 4.276058 and the values will range from 2.19759 to 6.354526 for variable Count when Season will be “Winter”, Temperature will be 0 degrees celsius, Humidity will be up to 20% and Wind Speed will be 0.5 m/s
We have reached the end of the project. I will append in the end, a section called Appendix which will hold code of my self-created functions to undertake some of the processes in the Data Wrangling.
## Functions that will be used in the program multiple times.
## For deleting unwanted columns.
deleteColumn <- function(dataFrame, columns){
for (col in columns){
dataFrame <- dataFrame %>%
select(-col) ## '-' notation means to select all except the given column.
}
return(dataFrame)
}
## For converting a column to class Date.
## formatDate is used because for the conversion you need to specify in what format your date is in.
## Using functions that tidyverse and lubridate offer.
convertDate <- function(dataFrame, formatDate){
dataFrame <- dataFrame %>%
mutate(Date = (as_date(parse_date_time(Date,formatDate))))
return(dataFrame)
}
## For Creating new Column FullDate holding Date and Hour of the day.
## as.integer(format(Date, format="%Y") -> this command is needed to extract the correct number/attribute from column Date. "%Y" tells the function to extract the year. as.integer() converts the character into an integer becasue field year expects as integer number. All work the same [month, day, hour]
createFullDate <- function(dataFrame){
dataFrame <- dataFrame %>%
mutate(FullDate = make_datetime( year = as.integer(format(Date, format="%Y"))
, month = as.integer(format(Date, format="%m"))
,day = as.integer(format(Date, format="%d"))
, hour = Hour
,min = 0
, sec = 0 ))
return(dataFrame)
}